The code style that we chose was BigCamelCase
For our project, we are using two NFL Data sets. We chose to have our project be about NFL data because we are both sports fans and specifically big NFL fans and we are both interested in the overall breakdown of statistics in football.
How each statistic varies by position?
What positions lead in what stats?
How do Quarterbacks stats compare between eras?
What Quarterback had the best season?
This data set is from Pro Football Reference. The purpose of this data set is that it contains the defensive statistics for all players who recorded at least 1 defensive stat in the 2023/24 NFL Season. A case in this data set is a individual player. Our analysis of the defense data set will focus on most of the attributes. Those attributes include: Position, G, GS, Int, TD, PD, FF, FR, Sk, Comb, Solo, Ast, TFL, QBHits, and Sfty.
DefenseStats <- read_excel("2023 NFL Defense Stats.xlsx")
Selected the columns from the data set we were interested in using excluding columns like player name, team, and return yards as those were not needed. Filtered the position column to only include defensive positions and not include offensive positions. Grouped all the data by position so the table would have one row for each position and the entries in each column would be the total stats for all players in that position.
DefS <- DefenseStats %>%
select(Position,G,GS,Int,TD,PD,FF,FR,Sk,Comb,Solo,Ast,TFL,QBHits, Sfty)
DefS <- DefS %>% filter (Position %in% c("DT", "DE", "LB", "CB", "S"))
DefS <- DefS %>% group_by(Position) %>%
summarize(
Games = sum(G),
Games_Started = sum(GS),
Interceptions = sum(Int),
Touchdowns = sum(TD),
Pass_Deflections = sum(PD),
Forced_Fumbles = sum(FF),
Fumble_Recoveries = sum(FR),
Sacks = sum(Sk),
Combined_Tackles = sum(Comb),
Solo_Tackles = sum(Solo),
Assisted_Tackles = sum(Ast),
Tackles_For_Loss = sum(TFL),
QB_Hits = sum(QBHits),
Safeties = sum(Sfty), .groups = "drop"
)
DefS %>%
kable(
caption = "Statistic Totals for Each NFL Position",
booktabs = TRUE,
align = rep("c", 11)
) %>%
kable_styling(
bootstrap_options = c("striped"),
font_size = 16
)
| Position | Games | Games_Started | Interceptions | Touchdowns | Pass_Deflections | Forced_Fumbles | Fumble_Recoveries | Sacks | Combined_Tackles | Solo_Tackles | Assisted_Tackles | Tackles_For_Loss | QB_Hits | Safeties |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CB | 3086 | 1482 | 190 | 26 | 1163 | 89 | 67 | 43.0 | 8270 | 6213 | 2057 | 297 | 100 | 1 |
| DE | 1211 | 515 | 2 | 1 | 104 | 50 | 25 | 329.5 | 2399 | 1420 | 979 | 429 | 730 | 1 |
| DT | 2508 | 1275 | 10 | 6 | 208 | 57 | 59 | 392.5 | 5047 | 2646 | 2401 | 641 | 1046 | 3 |
| LB | 3772 | 1670 | 78 | 18 | 461 | 153 | 106 | 567.0 | 12336 | 7523 | 4813 | 1056 | 1166 | 2 |
| S | 1810 | 1034 | 149 | 12 | 466 | 70 | 50 | 63.0 | 6719 | 4569 | 2150 | 225 | 128 | 0 |
For this data table, each row is a different position. Each column is a different stat and the entries are the total of that recorded stat for all players of that position.
For the Defense Data, we created 14 data visualizations. Each data visualization has position as the x-axis and the stat as the y-axis. Each data visualization shows which position leads in that statistic.
ggplot(DefS) +
aes(x = Position, y = Games) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Games",
title = "Total Games by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in games played and defensive ends have the least games played.
ggplot(DefS) +
aes(x = Position, y = Games_Started) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Games Started",
title = "Total Games Started by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in games started and defensive ends have the least games started.
ggplot(DefS) +
aes(x = Position, y = Interceptions) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Interceptions",
title = "Total Interceptions by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that corner backs lead in interceptions and defensive ends have the least interceptions.
ggplot(DefS) +
aes(x = Position, y = Touchdowns) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Touchdowns",
title = "Total Touchdowns by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that corner backs lead in touchdowns and defensive ends have the least touchdowns.
ggplot(DefS) +
aes(x = Position, y = Pass_Deflections) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Pass Deflections",
title = "Total Pass Deflections by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that corner backs lead in pass deflections and defensive ends have the least pass deflections.
ggplot(DefS) +
aes(x = Position, y = Forced_Fumbles) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Forced Fumbles",
title = "Total Forced Fumbles by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in forced fumbles and defensive ends have the least forced fumbles.
ggplot(DefS) +
aes(x = Position, y = Fumble_Recoveries) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Fumble Recoveries",
title = "Total Fumble Recoveries by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in fumble recoveries and defensive ends have the least fumble recoveries.
ggplot(DefS) +
aes(x = Position, y = Sacks) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Sacks",
title = "Total Sacks by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in sacks and corner backs have the least sacks.
ggplot(DefS) +
aes(x = Position, y = Combined_Tackles) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Combined Tackles",
title = "Total Combined Tackles by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in combined tackles and defensive ends have the least combined tackles.
ggplot(DefS) +
aes(x = Position, y = Solo_Tackles) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Solo Tackles",
title = "Total Solo Tackles by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in solo tackles and defensive ends have the least solo tackles.
ggplot(DefS) +
aes(x = Position, y = Assisted_Tackles) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Assisted Tackles",
title = "Total Assisted Tackles by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in assisted tackles and defensive ends have the least assisted tackles.
ggplot(DefS) +
aes(x = Position, y = Tackles_For_Loss) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Tackles For Loss",
title = "Total Tackles For Loss by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that blank lead in blank and blank have the least blank
ggplot(DefS) +
aes(x = Position, y = QB_Hits) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "QB Hits",
title = "Total QB Hits by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that linebackers lead in quarter back hits and corner backs have the least quarter back hits.
ggplot(DefS) +
aes(x = Position, y = Safeties) +
geom_col(fill = "#212221") +
labs(
x = "Position",
y = "Safeties",
title = "Total Safeties by Position for the 2023/2024 NFL Season"
) +
theme_light() +
theme(
plot.title = element_text(size = 15L,
face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")
)
This graph shows that defensive tackles lead in safeties and safeties have the least safeties.
Overall, linebackers lead in most of the statistics and defensive ends have the least of most statistics. This makes sense as linebackers have the most games played and started and defensive ends have the least.
This data set is from kaggle, and is every Quarterback season from 1970 to 2022. It shows different stats from the season including Passing Yards, Touchdowns, and Interceptions. However, there are many more stats included in the table that I did not feel were important for comparison
Stats <- read_excel("NFL QB Stats.xlsx")
For the Quarterbacks table there was quite a bit of wrangling of all sorts done. I began by selecting the top 20 quarterbacks by passing yards per year to factor out human variability, and Quarterback injuries during the season. Then the rest of the wrangling is separated into two categories, Yards and Points.
Starters <- Stats%>%
group_by(Year)%>%
arrange(desc(`Pass Yds`))%>%
filter(row_number()<21)%>%
arrange(desc(Year))
Many people discuss greatness of a Quarterbacks by putting most of their emphasis on the yards the Quarterback threw for. The first wrangling I did for the yards was create an average passing yards per year and merged the table onto the previously starters table. I then added the number of games played in a year, because since 1970 the number of games played in a year has changed twice, and that knowledge is necessary for comparisons. After adding the statistics for clarification, the main stat of Average Passing Yards+, a standardized statistic for passing yars per year, is created and added onto the starters table.
Era<- Starters%>%
group_by(Year)%>%
filter(Year != 1982)%>%
summarise(`Year Mean` = mean(`Pass Yds`))
Temp<-Era%>%
mutate(`Games per Year` = if_else(Year<1982,"14 Games","16 Games"))
Temp2<-replace(Temp$`Games per Year`,Temp$Year>2020,"17 Games")
NewEra<-Era%>%
mutate(`Games per Year` = Temp2)
MergedYards<-merge(x= Starters, y = Era, by = "Year", all.x = T)%>%
select(Year,Player, `Pass Yds`, TD, INT, `Year Mean`)
AdjYards<- MergedYards%>%
summarise(Year = as.character(Year), Player = Player, TD = TD, INT= INT,`APY (Average Passing Yards)` = `Year Mean` ,Yards = `Pass Yds`, `APY+` = round(100*(`Pass Yds`/`Year Mean`),0))%>%
arrange(desc(`APY+`))
The points variable created is a better way to compare Quarterbacks over the years by more stats than just passing yards. For points, each yard is a point, each touchdown is 100 points, and each interception is -50 points. The rest of the steps of the points data wrangling is the exact same process as above, but rather than the main stat of comparison being the passing yards stat, it is for the newly created points stat.
Points<-Starters%>%
group_by(Year)%>%
reframe(Year = Year, Player = Player, Yards = `Pass Yds`,TD = TD, INT = INT, Points = ((`Pass Yds`)+(TD*100)-(INT*50)))
EraPoints<- Points%>%
group_by(Year)%>%
filter(Year != 1982)%>%
summarise(`Year Mean` = mean(Points))
NewEraPoints<-EraPoints%>%
mutate(`Games per Year` = Temp2)
MergedPoints<-merge(x= Points, y = NewEraPoints, by = "Year", all.x = T)%>%
select(Year,Player, Points, TD, INT, `Year Mean`,`Games per Year`)
AdjPoints<- MergedPoints%>%
summarise(Year = as.character(Year), Player = Player, TD = TD, INT =INT,`AP (Average Points)` = `Year Mean`
,Points = Points, `AP+` = round(100*(Points/`Year Mean`),0),`Games per Year`= `Games per Year`)%>%
arrange(desc(`AP+`))
The created tables are the best era adjusted Quarterbacks of all time. I selected the top 5 based on the created Average Passing Yards+, APY+, and the Average Points+, AP+. The difference in the tables is the main reason why you cannot strictly focus on passing yards like some people in debates do. The top APY+ does not align with the top AP+ players showing the importance of the other stats that Quarterbacks have.
AdjYardTable<- AdjYards%>%
select(Year, Player, TD, INT ,`APY+`)
head(AdjYardTable,5)%>%
kable(caption = "Best APY+ since 1970",
align = c("l", rep("c", 10))) %>%
kable_styling(
bootstrap_options = c("striped"),
font_size = 16)
| Year | Player | TD | INT | APY+ |
|---|---|---|---|---|
| 1973 | Roman Gabriel | 23 | 12 | 172 |
| 1984 | Dan Marino | 48 | 17 | 158 |
| 1991 | Warren Moon | 23 | 21 | 150 |
| 1990 | Warren Moon | 33 | 13 | 149 |
| 1971 | John Hadl | 21 | 25 | 147 |
AdjPointTable<- AdjPoints%>%
select(Year, Player, TD, INT ,`AP+`)
head(AdjPointTable,5)%>%
kable(caption = "Best AP+ since 1970",
align = c("l", rep("c", 10))) %>%
kable_styling(
bootstrap_options = c("striped"),
font_size = 16)
| Year | Player | TD | INT | AP+ |
|---|---|---|---|---|
| 1984 | Dan Marino | 48 | 17 | 203 |
| 1973 | Roman Gabriel | 23 | 12 | 190 |
| 1986 | Dan Marino | 44 | 23 | 189 |
| 2007 | Tom Brady | 50 | 8 | 181 |
| 2013 | Peyton Manning | 55 | 10 | 176 |
The created visualizations can show why it is so difficult to compare players between eras, and can show who the most effective players were in the each statistical category.
The first visualizations that need to be seen are the year to year comparison of average passing yards and points per year to see why it is so difficult to compare players from different eras. As seen from the tables below the average from 1970 to 2022 has over doubled, and a good season from the 1970s would be a horrible season by today’s standards. This is the reason it is so difficult to compare players that did not play at the same time.
ggplot(NewEra) +
aes(x = Year, y = `Year Mean`, colour = `Games per Year`) +
geom_point(shape = "circle", size = 2L) +
scale_color_brewer(palette = "Set1", direction = 1) +
labs(x = "Year", y = "Average Passing Yards") +
theme_minimal() +
theme(
plot.title = element_text(size = 30L,
hjust = 0.5),
axis.title.y = element_text(size = 14L),
axis.title.x = element_text(size = 14L)
)
ggplot(NewEraPoints) +
aes(x = Year, y = `Year Mean`, colour = `Games per Year`) +
geom_point(shape = "circle", size = 2L) +
scale_color_brewer(palette = "Set1", direction = 1) +
labs(x = "Year", y = "Average Passing Yards") +
theme_minimal() +
theme(
plot.title = element_text(size = 30L,
hjust = 0.5),
axis.title.y = element_text(size = 14L),
axis.title.x = element_text(size = 14L)
)
These Data visualizations will take the Data Tables from above, and show the visualization through a bar plot.
#Best APY+ Barplot
ggplot(head(AdjYardTable, 5)) +
aes(x = Year, y = `APY+`, fill = Player) +
geom_col() +
scale_fill_brewer(palette = "Set1", direction = 1) +
labs(
x = "Year",
y = "AP+",
title = "Best AP+ since 1970",
fill = "Name"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 30L),
axis.title.y = element_text(size = 14L),
axis.title.x = element_text(size = 14L)
)
#Best AP+ Barplot
ggplot(head(AdjPointTable, 5)) +
aes(x = Year, y = `AP+`, fill = Player) +
geom_col() +
scale_fill_brewer(palette = "Set1", direction = 1) +
labs(
x = "Year",
y = "AP+",
title = "Best AP+ since 1970",
fill = "Name"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 30L),
axis.title.y = element_text(size = 14L),
axis.title.x = element_text(size = 14L)
)